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1. INTRODUCTION 

The health sector in any country should emphasize on having a healthy community. Advancement in 
health care is based on previous research in the field. Researchers need to search, read, analyze, and explore 
published documents in order to follow up with the progress researchers made in the field. Nowadays the 
number of electronic documents archived is increasing, and becoming harder to organize and understand, 
so to deal with this large number of documents a need arises to some techniques or computational tools to 
automatically organize these collections of documents. In addition, efficient search and browse should be 
considered. 

Existing search techniques try to match words in the query with the words in the documents to 
return documents that contain the questioned words. Words have multiple meanings, and therefore matching 
between words in the query and documents is not enough to retrieve the documents that are compatible with 
the user’s conceptual topic or meaning. Therefore, words in the same sentence should be considered rather 
than words separately. Researchers of machine learning and statistics used hierarchical probabilistic models 
called topic models to build new methods to find patterns of words from a collection of documents. 
These patterns reveal the topics contained in the documents. These hierarchical probabilistic models can be 
used with various kinds of data that ranges from words, images, and to survey information [1, 2] 

A Topic model is one type of statistical models that is used to discover the abstract topics of the 
document collection and it can also be thought of as a form of text mining, to obtain patterns of words in 
textual material. There are various kinds of topic models such as Latent Semantic Analysis (LSA), 
Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and Correlated Topic 
Model (CTM). LDA is the one that will be used in this research. 
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1.1. Latent Semantic analysis (LSA) 

LSA is a natural language processing technique that investigates the relationships between a set of 
documents and the terms they contain based on the distributional hypothesis. A vector space is created that 
contains words counts per paragraph for each text document. Singular value decomposition (SVD) technique 
is then applied. Words are then compared to decide on the similarity between words. LSA helps in finding 
information beyond the lexical level of word occurrences; it provides semantic relations between words and 
documents [3, 4]. 

A method to handle observed term-document association data statistically was proposed in [5], 
they assumed that there is underlying latent semantic structure in the data with randomness of word choice 
with respect to retrieval. They applied Latent Semantic Analysis (LSA) in order to estimate this latent 
structure and the noise of words. They created a semantic space for a large matrix of term-document 
association data in which terms and documents that are closely associated are placed next to each other. 


1.2. Probabilistic Latent Semantic Analysis (PLSA) 

PLSA is derived from a statistical view of LSA. It defines a generative data model that can be used 
in information retrieval, machine learning, natural language processing, and in related areas. PLSA is 
proposed to deal with the weaknesses of LSA that uses Singular Value Decomposition of co-occurrence 
tables; PLSA is based on a mixture decomposition derived from a latent class model it associates a latent 
context variable with each occurrence of word, which takes polysemy into consideration. There are two main 
advantages of PLSA: 1) Perplexity minimization for a document-specific unigram baseline. 2) Automated 
indexing of documents. One way to compare predictive performance of PLSA and LSA is to specify how to 
extract probabilities from LSA decomposition. The PLSA outperforms the LSA in perplexity reduction 
relating to the unigram baseline and shows improvements over Latent Semantic Analysis in a number of 
experiments [6, 7]. 


1.3. Latent Dirichlet Allocation (LDA) 

LDA is a generative statistical model for collection of text data. LDA is a three level hierarchical 
Bayesian model; each document of a collection is modeled as a mixture of various topics. Each topic is 
modeled as a mixture over a set of topic probabilities. In the text modeling, each topic probabilities provide 
an explicit representation of a document [8]. LDA deals with the words of the documents as a bag of words 
(it means that the order of the words in the document is not considered). The document is represented by 
term-document matrix that contains the occurrences of each word in each document of the collection [8, 1]. 

The main idea is that documents are represented as random mixtures over latent topics, where each 
topic is characterized by a distribution over words. LDA proposed the following generative process [8] for 
each document w in a corpus D: 

1) Choose the number of words N according to Poisson distribution. 
2) Then choose a topic mixture for the document according to Dirichlet distribution. Dir(q). 

A high value of « means that every document is likely to contain a mixture of most of the topics not 
just a single topic, low value of a means that a document is more likely to be presented by mixture of a few 
of the topics, so high « makes documents more similar to each other. 

3) for each of the N words: 
a) Choose a topic zn according to Multinomial distribution. 
b) generate a word (wn ) according to multinomial probability conditioned on the topic (zn). 

A document is a probability distribution over topics. A topic is a probability distribution over words. 
Words that appear in the same document are related. The model generates a document by taking the right 
number of words from specified topic and mixing them together. Every document is a collection of words 
that are taken from different topics. 

The model try to produce topic distribution, the distribution will have as many topics as we asked 
the model to make and the highest value of probabilities of words distribution present the fraction of words in 
the document that originated from a given topic. 

The result of LDA is a file that contains all topics made of the words with probabilities belonging to 
the topic. (Each document represented as a pattern of LDA topics). 


1.4. Research Motivation 

Several researches were conducted on health information systems and medical data[9-15]. 
Some researchers worked on the classification of different diseases such as diabetics [11], Alzheimer [12], 
cancer [13, 14] while others compared several classification and data mining algorithms on health data [15] 
whether these data were in English, Arabic[16], or multilingual[17, 18]. There are many applications on topic 
modeling that were applied in different domains by different topic modeling approaches. The literature 
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contains many examples on researchers who used topic modelling and especially LDA for either text 
classification [19, 20] or medical diagnosis [21, 22] and the method showed its efficiency[23, 24]. 

As a reason of unclassified documents and the difficulty to read and determine the topic of each 
document in the medical document collection, the classification should be done automatically by using topic 
models to make it possible to obtain the needed documents in a specific topic. The main objective of this 
research is to use LDA method on a collection of medical documents to classify these documents over three 
main topics that are strongly related to each other. 


2. RESEARCH METHOD 

This research is done in a series of operations to classify the collection of documents by using LDA 
topic modeling and study the performance of this technique on these documents, which can be summarized 
by Figure 1 as follows: 


= 


ay 


Figure |. The methodology main steps 
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The phases followed in this research for classifying medical documents are as follows: 


2.1. Data Collection: 

Medical articles have very long texts and contain many sections such as the abstract, introduction, 
materials and other sections about the diseases and their treatments....etc. In addition, there are a lot of tests 
with numbers and measurements that need to be recorded. 

The collection that is used in this research is gathered from medical web sites. The data set contains 
500 documents of medical articles that are collected from three medical websites: Medscape 
(http://www.medscape.com), Hindawi website = (http://www.hindawi.com ) and PubMed 
(http://www.cbi.nlm.nih.gov/pubmed/). These websites provide free access to many articles. Each document 
contains the abstract, conclusion and keywords of each article. The reason of choosing abstract, conclusion 
and keywords section of the article is that these parts represent the idea or summarize each article and contain 
the important words in the subject of the article. 

The collected medical documents are chosen from three categories: Heart Diseases, Blood Pressure 
or Hypertension and Cholesterol or Hyperlipidemia. 165 documents are about Heart Diseases, 181 
documents are about Blood Pressure and 154 are about Cholesterol. 


2.2. Preprocessing and Cleaning: 

Preprocessing and cleaning the documents from irrelevant data is an important step for any model 
[25] which will improve the results. This is the most important step in text analysis. Unclean data has a 
negative effect on the results. In this step, the collected documents from the previous step (that are saved in 
Notepad files) are cleaned, and the necessary preprocessing is done in order to make the documents 
ready to use. 

To carry out the implementation for preprocessing and topic modeling; R tool (R Studio) will be 
used. R Studio language tool is one of the most powerful and popular free software environments. R is a 
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language and environment for statistical computing and graphics. Furthermore, R provides a wide variety of 
statistics such as T-test, classification and clustering. Furthermost, R applies graphical techniques, and is 
highly extensible. 

The tm package (in R Studio tool) offers number of transactions that ease the process of cleaning 
data. For example, the corpus is cleaned using a known cleaning process such as removing Numbers, 
removing Punctuations, removing special Characters (@, #, %..), removing strip whitespaces, removing 
Stopwords (include English stop words like or, and, the...etc). In addition, the tool allows the user to add a 
list of words to the stop words list. As for example, the words from the collected data that have the lowest 
weight such as Abstract, Conclusion, Keywords, can be added [26]. 

The common English stop words from http://www.ranks.nl/stopwords website are used here. 
The 500 documents were input to the tm package for preprocessing and cleaning. A corpus of 158 different 
terms was resulted and will be used to classify the documents. 


2.3. Building the Document Term Matrix 

At this step, the Document-Term Matrix (DTM) is created, a matrix that lists all occurrences of 
words for each document in the corpus. In the DTM the rows represent the documents (each row labeled or 
start with document’s name) and the columns represent the terms (or words) of the documents, beside each 
document (or row) there are numbers 0, 1, 2, 3...n as entry under each column (term), this number means 
how many times a term occurs in specific document, if the matrix entry of one row (ex. Doc1) and under one 
of the columns (ex.term1) is zero, it means that this term doesn’t occur in this document otherwise it is 
possible to be 1, 2, 3...n where n is the frequency of that term. 

A list of terms of the matrix with their frequencies is sorted by their frequencies. Words with low 
frequencies were removed from the corpus in order to reduce the sparsity of the matrix. The sparsity was 
reduced from 99% to 82%. Furthermore, words that occur with high frequency in the corpus and are not 
important for the classification process, such as “abstract”, “keyword”, are also removed. 158 terms from the 
corpus of 500 documents will be used to classify the corpus into the suggested topics. 


2.4. Applying Topic Modeling on the Medical Documents 

The DTM that resulted from the previous step is used as an input to this phase. To apply LDA topic 
modeling, the topic modeling package is used and the number of topics is specified as 3 because the 
documents in the corpus were chosen from three different medical subjects (we choose 3 topics because we 
need to classify the documents to their real subjects from the three diseases). 

The output will be three topics (Heart Disease (Topic 1), Blood Pressure (Topic 2) and Cholesterol 
(Topic 3)) each one with associated terms that are related to that topic with different probabilities. Table 1 
shows the top 10 terms associated with each topic. It shows that for example, documents in topic 1 have a 
high probability of containing the words hypertension, study, blood,... while documents in topic 2 are more 
probable to have the words risk, disease, cardiovascular and so on. 

It should be emphasized that each document is considered to be a mixture of all topics (three topics 
in this research) and each topic contains all terms in the corpus with different probabilities. Table 2 shows the 
assigned probabilities of the first 12 documents to the three topics. The table indicates that document number 
1 is classified as topic 1 with a probability of 0.31, at the same time it is classified as topic 2 with a 0.40 
probability; and it is classified as topic 3 with a 0.28 probability. 

















Table 1. The top 10 Terms Related Each Topic Table 2. Assigning Documents Probabilities 
Topic 1 Topic 2 Topic 3 to the Topics 
hypertens Risk cholesterol Topicl Topic2 Topic3 
Studi Diseas level 0.313131 0.40404 0.282828 
Blood cardiovascular patient 0.295833 0.433333 0.270833 
pressur Heart effect 0.220238 0.264881 0.514881 
patient Clinic treatment 0.276423 0.227642 0.495935 
Age Chd increas 0.422572 0.186352 0.391076 
control Patient therapi 0.307359 0.307359 0.385281 
signific Coronary reduc 0.278867 0.206972 0.514161 
Group Outcome Lipid 0.280303 0.405303 0.314394 
higher Medic lower 0.185535 0.279874 0.534591 
0.290196 0.254902 0.454902 
0.22807 0.259649 0.512281 
0.435897 0.238095 0.326007 








Medical documents classification using topic modeling (Maryam Nuser) 


1528 O ISSN: 2502-4752 


2.5. Medical Documents Classification 

Finally after extracting the topics terms, each document has three values of assigned probabilities to 
the three topics. The Topic with the highest probability will be chosen to classify the document. As a result, 
Document number | is classified as topic 2, Document number 2 is also classified as topic 2, while document 
number 3 is classified as topic 3. These results are highlighted in Table 3 for the first 12 documents 
of the corpus. 


Table 3. List of the Documents Topic Assignments 
Document Topicl Topic2 Topic3 Assigned Topic 
L.txt 0.313131 0.40404 0.282828 2 
10.txt 0.295833 0.433333. =: 0.270833 
100.txt 0.220238 0.264881 0.514881 
101.txt 0.276423 0.227642 0.495935 
102.txt 0.422572 0.186352 0.391076 
103.txt 0.307359 = 0.307359 ~=—- 0.385281 
104.txt 0.278867 0.206972 0.514161 
105.txt 0.280303 0.405303 = 0.314394 
106.txt 0.185535 0.279874 0.534591 
107.txt 0.290196 0.254902 0.454902 
108.txt 0.22807 = 0.259649 = 0.512281 
109.txt 0.435897 0.238095 0.326007 
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3. RESULTS AND ANALYSIS 
3.1. Evaluation of the Accuracy of LDA 

Before the evaluation phase starts, the documents were sent to a medical expert in a summarized 
form. The expert classified the documents as belonging to topicl, 2 or 3. These classifications were used as a 
base to evaluate the classifications produced from the LDA algorithm. 

In this evaluation phase, the LDA classification results will be compared with the classification of 
the documents that is classified by experts in medical domain to determine the accuracy value of applying 
this technique on the data set. This value represents the effectiveness of LDA topic modeling in classifying 
medical text documents. Table 4 shows a sample of the comparison of the results predicted by LDA and 
classified by the expert for the first 12 documents. 

The accuracy is measured by using the Confusion Matrix as shown in Table 5, which shows how 
many documents are classified correctly and how many documents are misclassified. As for example, 
from the 181 documents in topic 1, 112 only were classified correctly; while 38 documents were 
misclassified as topic 2 and 27 documents were misclassified as topic 3. 


Table 4. A Comparison between the LDA Documents Classification with the Real Topic of the Documents 
Document _ predicted topic Actual topic | Match? 








1.txt 2 2 Yes 
10.txt Zo 2 Yes 
100.txt 3 c} Yes 
101.txt 3 3 Yes 
102.txt 1 3 No 
103.txt 3 3 Yes 
104. txt 3 3 Yes 
105.txt 2 3 No 
106.txt 3 3 Yes 
107.txt 3 3 Yes 
108.txt 3 3 Yes 
109. txt 1 1 Yes 





Table 5. The Confusion Matrix 








Actual Topic predicted Topic Total Accuracy 
Topic! _Topic2 Topic3 
Topicl 112 38 27 181 61.8% 
Topic2 17 134 15 165 81.2% 
Topic3 17 29 111 154 72.1% 
Total 146 201 153 500 
Accuracy 76.71% 66.7% 72.5% 71.4% 
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#of correctly classified documents 


Accuracy = 
y Total number of docuents 
11241344111 
mer ee 357/500 = 71.4% 


3.2. Results Analysis 

After applying the LDA model to classify 500 documents into three topics: Topic! about Blood 
Pressure or Hypertension, Topic2 about Heart Diseases or Cardiovascular and Topic3 about Cholesterol or 
Hyperlipidemia, the output was 146 documents were assigned as Topicl, 201 documents assigned as Topic2 
and 153 documents assigned as Topic3. The overall Accuracy of the documents classification was 71.4%. 

The LDA assigns a probability to each document. All documents were medical documents and this 
means that they have several words in common. In addition, the chosen diseases (topics) are related to each 
other. As a result, there is a probability that documents will be misclassified. In addition, in rare cases the 
LDA assigns probabilities that are close to each other. As for example, document 5 is assigned as topic 1 
with probability 42% and as topic 3 with probability 39%. As mentioned before this is because documents 
have words in common and are all in the same main category (medical documents). Additionally, 
the preprocessing step is important in affecting the process of extracting the topics. 


4. CONCLUSION 

Due to the large number of digital medical documents that are not classified into specific subjects or 
topics and because of the long text in each document and the several sections it has, a need to the 
classification arises. An automated classification method will reduce the time and effort needed to 
classification compared to manual classification by a field expert. 

One of the most common classification techniques is Topic Modeling. LDA topic model is used in 
this research to extract topics from the collected documents and assign them to the most probable topic. 
Five hundred documents were collected from medical websites. Preprocessing is done to the documents, 
and the results are fed to the LDA tool. The output was 357 documents were correctly classified from the 500 
documents in the collection. LDA shows an accuracy of 71.4%. 

Studying another Topic Modeling technique like CTM in order to see its performance and 
comparing it with the results of LDA Model on our collection is a future work that should be considered. 
A further study on the effect of stopwords removal on the results of the topic model and measure the 
accuracy of the classification before and after removing them can be done as future work. Another idea is to 
collect more documents (increasing the size of documents collection) and studying if the size of the 
collection affect the extraction of topics and the classification of documents or not. 
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